{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Splitting a Protein Dataset into Training and Test Sets\n",
    "\n",
    "Constructing careful splits of protein datasets can be tricky due to sequence homology. Graphein can take care of this for you.\n",
    "\n",
    "We use BLAST to cluster sequences based on similarity. Disjoint training and test sets can the be constructed from these clusters.\n",
    "\n",
    "\n",
    "We'll run through a small example of 4 sequences which we split into two equally sized training and test sets at 25% identity. We note that this is really the bare minimum one can prevent data leakage due to homology. Here we only account for sequence homology, however even proteins with 0% sequence identity can adopt very similar folds. Features for spltting based on SCOP and CATH annotations are priorities on our development roadmap. For a fuller discussion, see David Jones' excellent treatment of the potential pitfalls when working with machine learning in biology:\n",
    "\n",
    "> Setting the standards for machine learning in biology\n",
    "> David T. Jones\n",
    "> Nature Review Molecular Cell Biology\n",
    "> https://www.nature.com/articles/s41580-019-0176-5\n",
    "\n",
    "\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/a-r-j/graphein/blob/master/notebooks/splitting_a_dataset.ipynb)\n",
    "[![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/a-r-j/graphein/blob/master/notebooks/splitting_a_dataset.ipynb)\n",
    "\n",
    "\n",
    "## Requirements\n",
    "This functionality relies on BLAST. On linux, you can install it with:\n",
    "\n",
    "```bash\n",
    "sudo apt install ncbi-blast+\n",
    "```\n",
    "\n",
    "Otherwise, please see: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install graphein if necessary:\n",
    "# !pip install graphein\n",
    "\n",
    "# Install blast if necessary (linux):\n",
    "# !sudo apt install ncbi-blast+"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building the Dataset FASTA\n",
    "First, we need to assemble our sequences into a FASTA file that contains all of our queries.\n",
    "\n",
    "We can either do this based on a mapping of our own creation from PDBs to sequences or from a list of graphs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from graphein.ml.clustering import build_fasta_file_from_mapping\n",
    "\n",
    "pdb_sequence_mapping = {\n",
    "    \"3eiy\": \"SFSNVPAGKDLPQDFNVIIEIPAQSEPVKYEADKALGLLVVDRFIGTGMRYPVNYGFIPQTLSGDGDPVDVLVITPFPLLAGSVVRARALGMLKMTDESGVDAKLVAVPHDKVCPMTANLKSIDDVPAYLKDQIKHFFEQYKALEKGKWVKVEGWDGIDAAHKEITDGVANFKK\",\n",
    "    \"1lds\": \"MIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWD\",\n",
    "    \"4hhb\": \"VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR\",\n",
    "    \"7wda\": \"GLVVSFYTPATDGATFTAIAQRCNQQFGGRFTIAQVSLPRSPNEQRLQLARRLTGNDRTLDVMALDVVWTAEFAEAGWALPLSDDPAGLAENDAVADTLPGPLATAGWNHKLYAAPVTTNTQLLWYRPDLVNSPPTDWNAMIAEAARLHAAGEPSWIAVQANQGEGLVVWFNTLLVSAGGSVLSEDGRHVTLTDTPAHRAATVSALQILKSVATTPGADPSITRTEEGSARLAFEQGKAALEVNWPFVFASMLENAVKGGVPFLPLNRIPQLAGSINDIGTFTPSDEQFRIAYDASQQVFGFAPYPAVAPGQPAKVTIGGLNLAVAKTTRHRAEAFEAVRCLRDQHNQRYVSLEGGLPAVRASLYSDPQFQAKYPMHAIIRQQLTDAAVRPATPVYQALSIRLAAVLSPITEIDPESTADELAAQAQKAIDG\"\n",
    "    }\n",
    "\n",
    "build_fasta_file_from_mapping(pdb_sequence_mapping, \"sequences.fasta\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Reading PDB file... <span style=\"color: #3a3a3a; text-decoration-color: #3a3a3a\">━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span> <span style=\"color: #800080; text-decoration-color: #800080\">  0%</span> <span style=\"color: #008080; text-decoration-color: #008080\">-:--:--</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Reading PDB file... \u001b[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[35m  0%\u001b[0m \u001b[36m-:--:--\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# We could also build this mapping from a list of graphs\n",
    "import graphein.protein as gp\n",
    "from graphein.ml.clustering import build_fasta_file_from_graphs\n",
    "\n",
    "# Build graphs\n",
    "graphs = [gp.construct_graph(pdb_code=code) for code in [\"3eiy\", \"1lds\", \"4hhb\", \"7wda\"]]\n",
    "\n",
    "# Build fasta\n",
    "build_fasta_file_from_graphs(graphs, fasta_out=\"sequences.fasta\", chains=[\"A\", \"A\", \"A\", \"A\"]) # Chain param lets us select a specific chain in a structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">3eiy_A\n",
      "SFSNVPAGKDLPQDFNVIIEIPAQSEPVKYEADKALGLLVVDRFIGTGMRYPVNYGFIPQTLSGDGDPVDVLVITPFPLLAGSVVRARALGMLKMTDESGVDAKLVAVPHDKVCPMTANLKSIDDVPAYLKDQIKHFFEQYKALEKGKWVKVEGWDGIDAAHKEITDGVANFKK\n",
      ">1lds_A\n",
      "MIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWD\n",
      ">4hhb_A\n",
      "VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR\n",
      ">7wda_A\n",
      "GLVVSFYTPATDGATFTAIAQRCNQQFGGRFTIAQVSLPRSPNEQRLQLARRLTGNDRTLDVMALDVVWTAEFAEAGWALPLSDDPAGLAENDAVADTLPGPLATAGWNHKLYAAPVTTNTQLLWYRPDLVNSPPTDWNAMIAEAARLHAAGEPSWIAVQANQGEGLVVWFNTLLVSAGGSVLSEDGRHVTLTDTPAHRAATVSALQILKSVATTPGADPSITRTEEGSARLAFEQGKAALEVNWPFVFASMLENAVKGGVPFLPLNRIPQLAGSINDIGTFTPSDEQFRIAYDASQQVFGFAPYPAVAPGQPAKVTIGGLNLAVAKTTRHRAEAFEAVRCLRDQHNQRYVSLEGGLPAVRASLYSDPQFQAKYPMHAIIRQQLTDAAVRPATPVYQALSIRLAAVLSPITEIDPESTADELAAQAQKAIDG\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Inspect the FASTA file:\n",
    "with open(\"sequences.fasta\", \"r\") as f:\n",
    "    print(f.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Clustering proteins by sequence identity\n",
      "----------------------------------------\n",
      "Parameters:\n",
      "    - Number of sets to split: 1\n",
      "    - Sequence identity low threshold: 25.0%\n",
      "    - Fraction of sequences in test set: 0.5\n",
      "\n",
      "\n",
      "*** Creating sequences pairs for clustering\n",
      "\n",
      "\n",
      "Building a new DB, current time: 05/18/2022 18:38:39\n",
      "New DB name:   /home/atj39/github/graphein/notebooks/sequences.fasta\n",
      "New DB title:  sequences.fasta\n",
      "Sequence type: Protein\n",
      "Keep MBits: T\n",
      "Maximum file size: 1000000000B\n",
      "Adding sequences from FASTA; added 4 sequences in 0.000277042 seconds.\n",
      "*** Clustering from pairs of sequences\n",
      "*** Generating sets\n",
      "4 ids in 4 clusters\n",
      "Generated set LR_00 and TS_00 - 2 sequences for 2 clusters in test, among (4, 4) in total. Test relative size is 0.5\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['4hhb', '3eiy']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from graphein.ml.clustering import train_and_test_from_fasta\n",
    "\n",
    "train_and_test_from_fasta(fasta_file=\"sequences.fasta\", number_of_sets=1, fraction_in_test=0.5,\n",
    "                            cluster_file_name='s2d_clusters.txt', seq_id_low_thresh=25.,\n",
    "                              use_very_loose_condition=False, n_cpu=2,\n",
    "                              max_target_seqs=200, delete_db_when_done=True,\n",
    "                              train_set_key='LR', test_set_key='TS', early_break=True\n",
    "                              )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspecting the split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train Data:\n",
      "1lds\n",
      "7wda\n",
      "\n",
      "Test Data:\n",
      "4hhb\n",
      "3eiy\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Train Data\n",
    "print(\"Train Data:\")\n",
    "with open(\"LR_00\", \"r\") as f:\n",
    "    print(f.read())\n",
    "\n",
    "# Test Data\n",
    "print(\"Test Data:\")\n",
    "with open(\"TS_00\", \"r\") as f:\n",
    "    print(f.read())"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "0ab7f988027852efc1ebacd06db3f130eb65d2a20cb6a366311359132c20a952"
  },
  "kernelspec": {
   "display_name": "Python 3.8.13 ('graphein')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
